fix: error on CREATE EXTERNAL TABLE with no files and no explicit schema#21965
fix: error on CREATE EXTERNAL TABLE with no files and no explicit schema#21965adriangb wants to merge 2 commits intoapache:mainfrom
Conversation
Pointing CREATE EXTERNAL TABLE at an empty (or non-existent) location without an explicit column list previously produced a 0-column table. Subsequent queries against that table failed with a confusing "column not found" error far from the real cause. Now ListingOptions::infer_schema returns a clear Plan error when the location yields no files, instructing the user to either add data files or declare an explicit schema. The existing behavior of pre-declaring an empty table with an explicit schema (for later INSERT) still works. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Narrows the schema-inference error to the case the user actually encounters confusion in: an empty or non-existent directory that returns zero files from list_all_files. Locations that contain files which all happen to be 0-byte continue to produce an empty inferred schema as before, preserving the "0-byte files don't crash reads" behavior that several existing tests depend on. Also updates a few tests in datafusion/core that previously relied on empty fixture directories producing a 0-column table: - listing_table_factory tests now write a 0-byte placeholder file matching the format extension so the glob/extension assertions still exercise the inference code path. - read_dummy_folder and the empty-folder branch of read_from_different_file_extension now assert the new error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Related context: #21806 (comment) — that thread surfaced the same root cause from the angle of benchmark runners hitting it. The reasoning for fixing it here at the planning layer rather than in the runner is that the confusion isn't specific to benchmarks, so erroring on |
| // Empty files cannot affect schema but may throw when trying to read for it | ||
| .try_filter(|object_meta| future::ready(object_meta.size > 0)) |
There was a problem hiding this comment.
This just means we carry around memory for the ObjectMeta of zero sized files until a couple lines later. I think this is not a big problem.
The alternative is that we error even where there are 0 byte files present. I think that's an interesting discussion: e.g. a completely empty data.csv. Or hive partitioned directories with no data. I think all of these should still require an explicit schema or error, but there are tests that check the opposite behavior:
- test_csv_empty_file — registers tests/data/empty_0_byte.csv (0 bytes, no header, no data) and runs SELECT * FROM empty.
- test_csv_multiple_empty_files — folder of 0-byte CSVs. Same situation.
- it_can_read_empty_ndjson — 0-byte JSON file. Same.
- test_read_empty_parquet — 0-byte parquet file. Same.
- test_read_partitioned_empty_parquet — partition dir with a 0-byte parquet.
Which issue does this PR close?
Rationale for this change
When you point
CREATE EXTERNAL TABLEat an empty directory (or one that does not exist yet) without specifying an explicit column list, DataFusion silently creates a table with 0 columns. Any query against that table then fails with a confusing "column not found" / "no such column" error that gives no hint that the underlying issue is actually that schema inference had nothing to look at.This is the same root cause as the discussion on #21806 (comment) — that thread covered it from the angle of benchmark runners hitting it, but the confusion is not specific to benchmarks. Failing at
CREATE EXTERNAL TABLEtime with a clear, actionable message seemed like the right fix overall.What changes are included in this PR?
ListingOptions::infer_schemanow returns aPlanerror when the location yields no files (after the existing 0-byte filter), telling the user to either add data files or declare an explicit schema:Pre-declaring an empty table with an explicit schema (e.g.
CREATE EXTERNAL TABLE t(x int) STORED AS PARQUET LOCATION '...'for laterINSERT) still works — the inference path is only triggered when no schema is provided.Are these changes tested?
Yes. New cases in
datafusion/sqllogictest/test_files/ddl.sltcover:Planerror.Are there any user-facing changes?
Yes —
CREATE EXTERNAL TABLE ... LOCATION '<empty-dir>'without an explicit schema now errors at planning time instead of creating a 0-column table. Anyone relying on the previous behavior must add an explicit schema declaration. The error message tells them how.Use of AI
This code was written fully by AI. @adriangb gave it a detailed plan and reviewed the code by hand once this PR was opened and CI was green.